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SPECIFICATION 



A SPEECH COMMUNICATION SYSTEM AND 
METHOD FOR HANDLING LOST FRAMES 



INCORPORATION BY REFERENCE 

The following U.S. Patent Applications are hereby incorporated by reference in their 
entireties and made part of the present application: 

U.S. Patent Application Serial No. 09/156,650, titled "Speech Encoder Using Gain 
10 Normalization That Combines Open And Closed Loop Gains," Conexant Docket No. 
98RSS399, filed September 18, 1998; 

Provisional U.S. Patent Application Serial No. 60/155,321 titled "4 kbits/s Speech 
Coding," Conexant Docket No. 99RSS485, filed September 22, 1999; and 

U.S. Patent Application Serial No. 09/574,396 titled "A New Speech Gain 
15 Quantization Strategy," Conexant Docket No. 99RSS3 12, filed May 19, 2000. 



BACKGROUND OF THE INVENTION 

The field of the present invention relates generally to the encoding and decoding of 
speech in voice communication systems and, more particularly to a method and apparatus for 

20 handling erroneous or lost frames. 

To model basic speech sounds, speech signals are sampled over time and stored in 
frames as a discrete waveform to be digitally processed. However, in order to increase the 
efficient use of the communication bandwidth for speech, speech is coded before being 
transmitted especially when speech is intended to be transmitted under limited bandwidth 

25 constraints. Numerous algorithms have been proposed for the various aspects of speech 
coding. For example, an analysis-by-synthesis coding approach may be performed on a 
speech signal. In coding speech, the speech coding algorithm tries to represent 

1 
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characteristics of the speech signal in a manner which requires less bandwidth. For example, 
the speech coding algorithm seeks to remove redundancies in the speech signal. A first step 
is to remove short-term correlations. One type of signal coding technique is linear predictive 
coding (LPC). In using a LPC approach, the speech signal value at any particular time is 
modeled as a linear function of previous values. By using a LPC approach, short-term 
correlations can be reduced and efficient speech signal representations can be determined by 
estimating and applying certain prediction parameters to represent the signal. The LPC 
spectrum, which is an envelope of short term correlations in the speech signal, may be 
represented, for example, by LSFs (line spectral frequencies). After the removal of short- 
term correlations in a speech signal, a LPC residual signal remains. This residual signal 
contains periodicity information that needs to be modeled. The second step in removing 
redundancies in speech is to model the periodicity information. Periodicity information may 
be modeled by using pitch prediction. Certain portions of speech have periodicity while 
other portions do not. For example, the sound "aah" has periodicity information while the 
sound "shhh" has no periodicity information. 

In applying the LPC technique, a conventional source encoder operates on speech 
signals to extract modeling and parameter information to be coded for communication to a 
conventional source decoder via a communication channel. One way to code modeling and 
parameter information into a smaller amount of information is to use quantization. 
Quantization of a parameter involves selecting the closest entry in a table or codebook to 
represent the parameter. Thus, for example, a parameter of 0.125 may be represented by 0.1 
if the codebook contains 0, 0.1, 0.2, 0.3, etc. Quantization includes scalar quantization and 
vector quantization. In scalar quantization, one selects the entry in the table or codebook that 
is the closest approximation to the parameter, as described above. By contrast, vector 
quantization combines two or more parameters and selects the entry in the table or codebook 
which is closest to the combined parameters. For example, vector quantization may select 
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the entry in the codebook that is the closest to the difference between the parameters. A 
codebook used to vector quantize two parameters at once is often referred to as a two- 
dimensional codebook. A n-dimensional codebook quantizes n parameters at once. 

Quantized parameters may be packaged into packets of data which are transmitted 
from the encoder to the decoder. In other words, once coded, the parameters representing the 
input speech signal are transmitted to a transceiver. Thus, for example, the LSF's may be 
quantized and the index into a codebook may be converted into bits and transmitted from the 
encoder to the decoder. Depending on the embodiment, each packet may represent a portion 
of a frame of the speech signal, a frame of speech, or more than a frame of speech. At the 
transceiver, a decoder receives the coded information. Because the decoder is configured to 
know the manner in which speech signals are encoded, the decoder decodes the coded 
information to reconstruct a signal for playback that sounds to the human ear like the original 
speech. However, it may be inevitable that at least one packet of data is lost during 
transmission and the decoder does not receive all of the information sent by the encoder. For 
instance, when speech is being transmitted from a cell phone to another cell phone, data may 
be lost when reception is poor or noisy. Therefore, transmitting the coded modeling and 
parameter information to the decoder requires a way for the decoder to correct or adjust for 
lost packets of data. While the prior art describes certain ways of adjusting for lost packets 
of data such as by extrapolation to try to guess what the information was in the lost packet, 
these methods are limited such that improved methods are needed. 

Besides LSF information, other parameters transmitted to the decoder may be lost. In 
CELP (Code Excited Linear Prediction) speech coding, for example, there are two types of 
gain which are also quantized and transmitted to the decoder. The first type of gain is the 
pitch gain G P , also known as the adaptive codebook gain. The adaptive codebook gain is 
sometimes referred to, including herein, with the subscript "a" instead of the subscript "p". 
The second type of gain is the fixed codebook gain Gc- Speech coding algorithms have 
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quantized parameters including the adaptive codebook gain and the fixed codebook gain." - 
Other parameters may, for example, include pitch lags which represent the periodicity of 
voiced speech. If the speech encoder classifies speech signals, the classification information 
about the speech signal may also be transmitted to the decoder. For an improved speech 
encoder/decoder that classifies speech and operates in different modes, see U.S. Patent 
Application Serial No. 09/574,396 titled "A New Speech Gain Quantization Strategy," 
Conexant Docket No. 99RSS3 12, filed May 19, 2000, which was previously incorporated 
herein by reference. 

Because these and other parameter information are sent over imperfect transmission 
means to the decoder, some of these parameters are lost or never received by the decoder. 
For speech communication systems that transmit a packet of information per frame of 
speech, a lost packet results in a lost frame of information. In order to reconstruct or estimate 
the lost information, prior art systems have tried different approaches, depending on the 
parameter lost. Some approaches simply use the parameter from the previous frame that 
actually was received by the decoder. These prior art approaches have their disadvantages, 
inaccuracies and problems. Thus, there is a need for an improved way to correct or adjust for 
lost information so as to recreate a speech signal as close as possible to the original speech 
signal. 

Certain prior art speech communication systems do not transmit a fixed codebook 
excitation from the encoder to the decoder in order to save bandwidth. Instead, these systems 
have a local Gaussian time series generator that uses an initial fixed seed to generate a 
random excitation value and then updates that seed every time the system encounters a frame 
containing silence or background noise. Thus, the seed changes for every noise frame. 
Because the encoder and decoder have the same Gaussian time series generator that uses the 
same seeds in the same sequence, they generate the same random excitation value for noise 
frames. However, if a noise frame is lost and not received by the decoder, the encoder and 
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decoder use different seeds for the same noise frame, thereby losing their synchronicity. 
Thus, there is a need for a speech communication system that does not transmit fixed 
codebook excitation values to the decoder, but which maintains synchronicity between the 
encoder and decoder when a frame is lost during transmission. 

5 

SUMMARY OF THE INVENTION 

Various separate aspects of the present invention can be found in a speech 
communication system and method that has an improved way of handling information lost 
during transmission from the encoder to the decoder. In particular, the improved speech 

10 communication system is able to generate more accurate estimates for the information lost in 
a lost packet of data. For example, the improved speech communication system is able to 
handle more accurately lost information such as LSF, pitch lag (or adaptive codebook 
excitation), fixed codebook excitation and/or gain information. In an embodiment of a 
speech communication system that does not transmit fixed codebook excitation values to the 

15 decoder, the improved encoder/decoder are able to generate the same random excitation 
values for a given noise frame even if a previous noise frame was lost during transmission. 

A first, separate aspect of the present invention is a speech communication system 
that handles lost LSF information by setting the minimum spacing between LSF f s to an 
increased value and then decreasing the value for subsequent frames in a controlled adaptive 

20 manner. 

A second, separate aspect of the present invention is a speech communication system 
that estimates a lost pitch lag by extrapolating from the pitch lags of a plurality of the 
preceding received frames. 

A third, separate aspect of the present invention is a speech communication system 

5 
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that receives the pitch lag of the succeeding received frame and uses curve fitting between 
the pitch lag of the preceding received frame and the pitch lag of the succeeding received 
frame to fine tune its estimation of the pitch lag for the lost frame so as to adjust or correct 
the adaptive codebook buffer prior to its use by subsequent frames. 
5 A fourth, separate aspect of the present invention is a speech communication system 

that estimates a lost gain parameter for periodic-like speech differently than it estimates a lost 
gain parameter for non-periodic like speech. 

A fifth, separate aspect of the present invention is a speech communication system 
that estimates a lost adaptive codebook gain parameter differently than it estimates a lost 

10 fixed codebook gain parameter. 

A sixth, separate aspect of the present invention is a speech communication system 
that determines a lost adaptive codebook gain parameter for a lost frame of non-periodic like 
speech based on the average adaptive codebook gain parameter of the subframes of an 
adaptive number of previously received frames. 

15 A seventh, separate aspect of the present invention is a speech communication system 

that determines a lost adaptive codebook gain parameter for a lost frame of non-periodic like 
speech based on the average adaptive codebook gain parameter of the subframes of an 
adaptive number of previously received frames and the ratio of the adaptive codebook 
excitation energy to the total excitation energy. 

20 An eighth, separate aspect of the present invention is a speech communication system 

that determines a lost adaptive codebook gain parameter for a lost frame of non-periodic like 
speech based on the average adaptive codebook gain parameter of the subframes of an 
adaptive number of previously received frames, the ratio of the adaptive codebook excitation 
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energy to the total excitation energy, the spectral tilt of the previously received frame and/or 
energy of the previously received frame. 

A ninth, separate aspect of the present invention is a speech communication system 
that sets a lost adaptive codebook gain parameter for a lost frame of non-periodic like speech 
5 to an arbitrarily high number. 

A tenth, separate aspect of the present invention is a speech communication system 
that sets a lost fixed codebook gain parameter to zero for all subframes of a lost frame of 
non-periodic like speech. 

An eleventh, separate aspect of the present invention is a speech communication 
10 system that determines a lost fixed codebook gain parameter for the current subframe of the 
lost frame of non-periodic like speech based on the ratio of the energy of the previously 
received frame to the energy of the lost frame. 

A twelfth, separate aspect of the present invention is a speech communication system 
that determines a lost fixed codebook gain parameter for the current subframe of the lost 
15 frame based on the ratio of the energy of the previously received frame to the energy of the 
lost frame and then attenuates that parameter to set the lost fixed codebook gain parameters 
for the remaining subframes of the lost frame. 

A thirteenth, separate aspect of the present invention is a speech communication 
system that sets a lost adaptive codebook gain parameter for the first frame of periodic like 
20 speech to be lost after a received frame to an arbitrarily high number. 

A fourteenth, separate aspect of the present invention is a speech communication 
system that sets a lost adaptive codebook gain parameter for the first frame of periodic like 
speech to be lost after a received frame to an arbitrarily high number and then attenuates that 

7 
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parameter to set the lost adaptive codebook gain parameters for the remaining subframes of 
the lost frame. 

A fifteenth, separate aspect of the present invention is a speech communication 
system that sets a lost fixed codebook gain parameter for a lost frame of periodic like speech 
5 to zero if the average adaptive codebook gain parameter of a plurality of the previously 
received frames exceeds a threshold. 

A sixteenth, separate aspect of the present invention is a speech communication 
system that determines a lost fixed codebook gain parameter for the current subframe of a 
lost frame of periodic like speech based on the ratio of the energy of the previously received 
10 frame to the energy of the lost frame if the average adaptive codebook gain parameter of a 
plurality of the previously received frames does not exceed a threshold, 

A seventeenth, separate aspect of the present invention is a speech communication 
system that determines a lost fixed codebook gain parameter for the current subframe of a 

lost frame based on the ratio of the energy of the previously received frame to the energy of 

1 5 the lost frame and then attenuates that parameter to set the lost fixed codebook gain 

parameters for the remaining subframes of the lost frame if the average adaptive codebook 
gain parameter of a plurality of the previously received frames exceeds a threshold. 

An eighteenth, separate aspect of the present invention is a speech communication 
system that randomly generates a fixed codebook excitation for a given frame by using a 
20 seed whose value is determined by information in that frame. 

A nineteenth, separate aspect of the present invention is a speech communication 
decoder that after estimating lost parameters in a lost frame and synthesizing the speech, 
matches the energy of the synthesized speech to the energy of the previously received frame. 

8 

3NSDOCID: <WO 0207061 A 2J_> 



WO 02/07061 



PCT7IB01/01228 



A twentieth, separate aspect of the present invention is any of the above separate 
aspects, either individually or in some combination. 

Further separate aspects of the present invention can also be found in a method of 
encoding and/or decoding a speech signal that practices any of the above separate aspects, 
either individually or in some combination. 

Other aspects, advantages and novel features of the present invention will become 
apparent from the following Detailed Description Of A Preferred Embodiment, when 
considered in conjunction with the accompanying figures. 

BRIEF DESCRIPTION OF THE FIGURES 

FIG. 1 is a functional block diagram of a speech communication system having a 
source encoder and source decoder. 

FIG. 2 is a more detailed functional block diagram of the speech communication 
system of FIG. 1. 

FIG. 3 is a functional block diagram of an exemplary first stage, a speech pre- 
processor, of the source encoder used by one embodiment of the speech communication 
system of FIG. 1. 

FIG. 4 is a functional block diagram illustrating an exemplary second stage of the 
source encoder used by one embodiment of the speech communication system of FIG. 1 . 

FIG. 5 is a functional block diagram illustrating an exemplary third stage of the 
source encoder used by one embodiment of the speech communication system of FIG. L 

FIG. 6 is a functional block diagram illustrating an exemplary fourth stage of the 
source encoder used by one embodiment of the speech communication system of FIG. 1 for 
processing non-periodic speech (mode 0). 
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FIG. 7 is a functional block diagram illustrating an exemplary fourth stage of the 
source encoder used by one embodiment of the speech communication system of FIG. 1 for 
processing periodic speech (mode 1). 

FIG. 8 is a block diagram of one embodiment of a speech decoder for processing 
5 coded information from a speech encoder built in accordance with the present invention. 

FIG. 9 illustrates a hypothetical example of received frames and a lost frame. 

FIG. 1 0 illustrates a hypothetical example of received frames and a lost frame as well 
as the minimum spacings between LSF f s assigned to each frame in a prior art system and a 
speech communication system built in accordance with the present invention. 
1 0 FIG. 1 1 illustrates a hypothetical example showing how a prior art speech 

communication system assigns and uses pitch lag and delta pitch lag information for each 
frame. 

FIG. 12 illustrates a hypothetical example showing how a speech communication 
system built in accordance with the present invention assigns and uses pitch lag and delta 
1 5 pitch lag information for each frame. 

FIG. 1 3 illustrates a hypothetical example showing how a speech decoder built in 
accordance with the present invention assigns adaptive gain parameter information for each 
frame when there is a lost frame. 

FIG. 14 illustrates a hypothetical example showing how a prior art encoder uses seeds 
20 to generate a random excitation value for each frame containing silence or background noise. 

FIG. 1 5 illustrates a hypothetical example showing how a prior art decoder uses seeds 
to generate a random excitation value for each frame containing silence or background noise 
and loses synchronicity with the encoder if there is a lost frame. 

10 
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FIG. 16 is a flowchart showing an example processing of nonperiodic-like speech in 
accordance with the present invention. 

FIG. 1 7 is a flowchart of showing an example processing of periodic-like speech in 
accordance with the present invention. 



DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT 

First a general description of the overall speech communication system is described, 
and then a detailed description of an embodiment of the present invention is provided. 

FIG. 1 is a schematic block diagram of a speech communication system illustrating 
10 the general use of a speech encoder and decoder in a communication system. A speech 

communication system 100 transmits and reproduces speech across a communication channel 
103. Although it may comprise for example a wire, fiber, or optical link, the communication 
channel 103 typically comprises, at least in part, a radio frequency link that often must 
support multiple, simultaneous speech exchanges requiring shared bandwidth resources such 
15 as may be found with cellular telephones. 

A storage device may be coupled to the communication channel 103 to temporarily 
store speech information for delayed reproduction or playback, e.g., to perform answering 
machine functions, voiced email, etc. Likewise, the communication channel 103 might be 
replaced by such a storage device in a single device embodiment of the communication 
20 system 100 that, for example, merely records and stores speech for subsequent playback. 
In particular, a microphone 111 produces a speech signal in real time. The 
microphone 1 1 1 delivers the speech signal to an A/D (analog to digital) converter 115. The 
A/D converter 1 1 5 converts the analog speech signal into a digital form and then delivers the 
digitized speech signal to a speech encoder 117. 



11 
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The speech encoder 1 17 encodes the digitized speech by using a selected one of a 
plurality of encoding modes. Each of the plurality of encoding modes uses particular 
techniques that attempt to optimize the quality of the resultant reproduced speech. While 
operating in any of the plurality of modes, the speech encoder 117 produces a series of 

5 modeling and parameter information {e.g., "speech parameters") and delivers the speech 
parameters to an optional channel encoder 119. 

The optional channel encoder 119 coordinates with a channel decoder 131 to deliver 
the speech parameters across the communication channel 103. The channel decoder 131 
forwards the speech parameters to a speech decoder 133. While operating in a mode that 

10 corresponds to that of the speech encoder 117, the speech decoder 133 attempts to recreate 
the original speech from the speech parameters as accurately as possible. The speech 
decoder 133 delivers the reproduced speech to a D/A (digital to analog) converter 135 so that 
the reproduced speech may be heard through a speaker 137. 

FIG. 2 is a functional block diagram illustrating an exemplary communication device 

15 of FIG. 1. A communication device 151 comprises both a speech encoder and decoder for 
simultaneous capture and reproduction of speech. Typically within a single housing, the 
communication device 151 might, for example, comprise a cellular telephone, portable 
telephone, computing system, or some other communication device. Alternatively, if a 
memory element is provided for storing encoded speech information, the communication 

20 device 151 might comprise an answering machine, a recorder, voice mail system, or other 
communication memory device. 

A microphone 155 and an AID converter 157 deliver a digital voice signal to an 
encoding system 159. The encoding system 159 performs speech encoding and delivers 
resultant speech parameter information to the communication channel. The delivered speech 

25 parameter information may be destined for another communication device (not shown) at a 
remote location. 
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As speech parameter information is received, a decoding system 1 65 performs speech 
decoding. The decoding system delivers speech parameter information to a D/A converter 
167 where the analog speech output may be played on a speaker 169. The end result is the 
reproduction of sounds as similar as possible to the originally captured speech. 
5 The encoding system 159 comprises both a speech processing circuit 1 85 that 

performs speech encoding and an optional channel processing circuit 187 that performs the 
optional channel encoding. Similarly, the decoding system 165 comprises a speech 
processing circuit 1 89 that performs speech decoding and an optional channel processing 
circuit 191 that performs channel decoding. 

10 Although the speech processing circuit 185 and the optional channel processing 

circuit 1 87 are separately illustrated, they may be combined in part or in total into a single 
unit. For example, the speech processing circuit 185 and the channel processing circuitry 
187 may share a single DSP (digital signal processor) and/or other processing circuitry. 
Similarly, the speech processing circuit 189 and optional the channel processing circuit 191 

1 5 may be entirely separate or combined in part or in whole. Moreover, combinations in whole 
or in part may be applied to the speech processing circuits 185 and 189, the channel 
processing circuits 187 and 191, the processing circuits 185, 187, 189 and 191, or otherwise 
as appropriate. Further, each or all of the circuits which control aspects of the operation of 
the decoder and/or encoder may be referred to as a control logic and may be implemented, 

20 for example, by a microprocessor, microcontroller, CPU (central processing unit), ALU 

(arithmetic logic unit), a co-processor, an ASIC (application specific integrated circuit), or 
any other kind of circuit and/or software. 

The encoding system 159 and the decoding system 165 both use a memory 161. The 
speech processing circuit 185 uses a fixed codebook 181 and an adaptive codebook 183 of a 

25 speech memory 177 during the source encoding process. Similarly, the speech processing 
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; : - circuit 1 89 uses the fixed codebook 181 and the adaptive codebook 4 83 during the source 
decoding process. 

Although the speech memory 177 as illustrated is shared by the speech processing 
circuits 185 and 189, one or more separate speech memories can be assigned to each of the 

5 processing circuits 185 and 189. The memory 161 also contains software used by the 

processing circuits 185, 187, 189 and 191 to perform various functions required in the source 
encoding and decoding processes. 

Before discussing the details of an embodiment of the improvement in speech coding, 
an overview of the overall speech encoding algorithm is provided at this point. The 

10 improved speech encoding algorithm referred to in this specification may be, for example, 

the eX-CELP (extended CELP) algorithm which is based on the CELP model. The details of 
the eX-CELP algorithm is discussed in a U.S. patent application assigned to the same 
assignee, Conexant Systems, Inc., and previously incorporated herein by reference: 
Provisional U.S. Patent Application Serial No. 60/155,321 titled "4 kbits/s Speech Coding," 

1 5 Conexant Docket No. 99RSS485, filed September 22, 1 999. 

In order to achieve toll quality at a low bit rate (such as 4 kilobits per second), the 
improved speech encoding algorithm departs somewhat from the strict waveform-matching 
criterion of traditional CELP algorithms and strives to capture the perceptually important 
features of the input signal. To do so, the improved speech encoding algorithm analyzes the 

20 input signal according to certain features such as degree of noise-like content, degree of 
spiky-like content, degree of voiced content, degree of unvoiced content, evolution of 
magnitude spectrum, evolution of energy contour, evolution of periodicity, etc., and uses this 
information to control weighting during the encoding and quantization process. The 
philosophy is to accurately represent the perceptually important features and allow relatively 

25 larger errors in less important features. As a result, the improved speech encoding algorithm 
focuses on perceptual matching instead of waveform matching. The focus on perceptual 
matching results in satisfactory speech reproduction because of the assumption that at 4 kbits 
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per second, waveform matching is not sufficiently accurate to capture faithfidly all . 
information in the input signal. Consequently, the improved speech encoder performs some 
prioritizing to achieve improved results. 

In one particular embodiment, the improved speech encoder uses a frame size of 20 
milliseconds, or 160 samples per second, each frame being divided into either two or three 
subframes. The number of subframes depends on the mode of subframe processing. In this 
particular embodiment, one of two modes may be selected for each frame of speech: Mode 0 
and Mode 1 . Importantly, the manner in which subframes are processed depends on the 
mode. In this particular embodiment, Mode 0 uses two subframes per frame where each 
subframe size is 10 milliseconds in duration, or contains 80 samples. Likewise, in this 
example embodiment, Mode 1 uses three subframes per frame where the first and second 
subframes are 6.625 milliseconds in duration, or contains 53 samples, and the third subframe 
is 6.75 milliseconds in duration, or contains 54 samples. In both Modes, a look-ahead of 15 
milliseconds may be used. For both Modes 0 and 1, a tenth order Linear Prediction (LP) 
model may be used to represent the spectral envelope of the signal. The LP model may be 
coded in the Line Spectrum Frequency (LSF) domain by using, for example, a delayed- 
decision, switched multi-stage predictive vector quantization scheme. 

Mode 0 operates a traditional speech encoding algorithm such as a CELP algorithm. 
However, Mode 0 is not used for all frames of speech. Instead, Mode 0 is selected to handle 
frames of all speech other than "periodic-like" speech, as discussed in greater detail below. 
For convenience, "periodic-like" speech is referred to here as periodic speech, and all other 
speech is "non-periodic" speech. Such "non-periodic" speech include transition frames 
where the typical parameters such as pitch correlation and pitch lag change rapidly and 
frames whose signal is dominantly noise-like. Mode 0 breaks each frame into two 
subframes. Mode 0 codes the pitch lag once per subframe and has a two-dimensional vector 
quantizer to jointly code the pitch gain (i.e., adaptive codebook gain) and the fixed codebook 
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gain once per subframe. In this example enbodiment, the fixed codebook contains two pulse 
sub-codebooks and one Gaussian sub-codebook; the two pulse sub-codebooks have two and 
three pulses, respectively. 

Mode 1 deviates from the traditional CELP algorithm. Mode 1 handles frames 
containing periodic speech which typically have high periodicity and are often well 
represented by a smooth pitch tract. In this particular embodiment, Mode 1 uses three 
subframes per frame. The pitch lag is coded once per frame prior to the subframe processing 
as part of the pitch pre-processing and the interpolated pitch tract is derived from this lag. 
The three pitch gains of the subframes exhibit very stable behavior and are jointly quantized 
using pre- vector quantization based on a mean-squared error criterion prior to the closed 
loop subframe processing. The three reference pitch gains which are unquantized are derived 
from the weighted speech and are a byproduct of the frame-based pitch pre-processing^ 
Using the pre-quantized pitch gains, the traditional CELP subframe processing is performed, 
except that the three fixed codebook gains are left unquantized. The three fixed codebook 
gains are jointly quantized after subframe processing which is based on a delayed decision 
approach using a moving average prediction of the energy. The three subframes are 
subsequently synthesized with fully quantized parameters. 

The manner in which the mode of processing is selected for each frame of speech 
based on the classification of the speech contained in the frame and the innovative way in 
which periodic speech is processed allows for gain quantization with significantly fewer bits 
without any significant sacrifice in the perceptual quality of the speech. Details of this 
manner of processing speech are provided below. 

FIGs. 3-7 are functional block diagrams illustrating a multi-stage encoding approach 
used by one embodiment of the speech encoder illustrated in FIGs. 1 and 2. In particular, 
FIG. 3 is a functional block diagram illustrating a speech pre-processor 193 that comprises 
the first stage of the multi-stage encoding approach; FIG. 4 is a functional block diagram 
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illustrating the second stage; FIGs. 5 and 6 are functional block diagrams depicting Mode 0 - 
of the third stage; and FIG. 7 is a functional block diagram depicting Mode 1 of the third 
stage. The speech encoder, which comprises encoder processing circuitry, typically operates 
under software instruction to carry out the following functions. 

Input speech is read and buffered into frames. Turning to the speech pre-processor 
193 of FIG. 3 5 a frame of input speech 192 is provided to a silence enhancer 195 that 
determines whether the frame of speech is pure silence, i.e., only "silence noise" is present. 
The speech enhancer 1 95 adaptively detects on a frame basis whether the current frame is 
purely "silence noise." If the signal 192 is "silence noise," the speech enhancer 195 ramps 
the signal to the zero-level of the signal 192. Otherwise, if the signal 192 is not "silence 
noise," the speech enhancer 195 does not modify the signal 192. The speech enhancer 195 
cleans up the silence portions of the clean speech for very low level noise and thus enhances 
the perceptual quality of the clean speech. The effect of the speech enhancement function 
becomes especially noticeable when the input speech originals from an A-law source; that is, 
the input has passed through A-law encoding and decoding immediately prior to processing 
by the present speech coding algorithm. Because A-law amplifies sample values around 0 
(e.g., -1, 0, +1) to either -8 or +8, the amplification in A-law could transform an inaudible 
silence noise into a clearly audible noise. After processing by the speech enhancer 195, the 
speech signal is provided to a high-pass filter 197. 

The high-pass filter 1 97 eliminates frequencies below a certain cutoff frequency and 
permits frequencies higher than the cutoff frequency to pass to a noise attenuator 199. In this 
particular embodiment, the high-pass filter 197 is identical to the input high-pass filter of the 
G.729 speech coding standard of ITU-T. Namely, it is a second order pole-zero filter with a 
cut-off frequency of 140 hertz (Hz). Of course, the high-pass filter 197 need not be such a 
filter and may be constructed to be any kind of appropriate filter known to those of ordinary 
skill in the art. 
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The noise attenuator 199 performs a noise suppression algorithm. In this particular 
embodiment, the noise attenuator 199 performs a weak noise attenuation of a maximum of 5 
decibels (dB) of the environmental noise in order to improve the estimation of the parameters 
by the speech encoding algorithm. The specific methods of enhancing silence, building a 
5 high-pass filter 1 97 and attenuating noise may use any one of the numerous techniques 

known to those of ordinary skill in the art. The output of the speech pre-processor 193 is pre- 
processed speech 200. 

Of course, the silence enhancer 195, high-pass filter 197 and noise attenuator 199 
may be replaced by any other device or modified in a manner known to those of ordinary 

10 skill in the art and appropriate for the particular application. 

Turning to FIG. 4, a functional block diagram of the common frame-based processing 
of a speech signal is provided. In other words, FIG. 4 illustrates the processing of a speech 
signal on a frame-by-frame basis. This frame processing occurs regardless of the mode (e.g., 
Modes 0 or 1) before the mode-dependent processing 250 is performed. The pre-processed 

15 speech 200 is received by a perceptual weighting filter 252 that operates to emphasize the 
valley areas and de-emphasize the peak areas of the pre-processed speech signal 200. The 
perceptual weighting filter 252 may be replaced by any other device or modified in a manner 
known to those of ordinary skill in the art and appropriate for the particular application. 

A LPC analyzer 260 receives the pre-processed speech signal 200 and estimates the 

20 short term spectral envelope of the speech signal 200. The LPC analyzer 260 extracts LPC 
coefficients from the characteristics defining the speech signal 200. In one embodiment, 
three tenth-order LPC analyses are performed for each frame. They are centered at the 
middle third, the last third and the lookahead of the frame. The LPC analysis for the 
lookahead is recycled for the next frame as the LPC analysis centered at the first third of the 

25 frame. Thus, for each frame, four sets of LPC parameters are generated. The LPC analyzer 
260 may also perform quantization of the LPC coefficients into, for example, a line spectral 
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frequency (LSF) domain. The quantization of the-LPC coefficients may be either scalar or 
vector quantization and may be performed in any appropriate domain in any manner known 
in the art. 

A classifier 270 obtains information about the characteristics of the pre-processed 
speech 200 by looking at, for example, the absolute maximum of frame, reflection 
coefficients, prediction error, LSF vector from the LPC analyzer 260, the tenth order 
autocorrelation, recent pitch lag and recent pitch gains. These parameters are known to those 
of ordinary skill in the art and for that reason, are not further explained here. The classifier 
270 uses the information to control other aspects of the encoder such as the estimation of 
signal-to-noise ratio, pitch estimation, classification, spectral smoothing, energy smoothing 
and gain normalization. Again, these aspects are known to those of ordinary skill in the art 
and for that reason, are not further explained here. A brief summary of the classification 
algorithm is provided next. 

The classifier 270, with help from the pitch preprocessor 254, classifies each frame 
into one of six classes according to the dominating feature of the frame. The classes are (1) 
Silence/background Noise; (2) Noise/Like Unvoiced Speech; (3) Unvoiced; (4) Transition 
(includes onset); (5) Non-Stationary Voiced; and (6) Stationary Voiced. The classifier 270 
may use any approach to classify the input signal into periodic signals and non-periodic 
signals. For example, the classifier 270 may take the pre-processed speech signal, the pitch 
lag and correlation of the second half of the frame, and other information as input 
parameters. 

Various criteria can be used to determine whether speech is deemed to be periodic. 
For example, speech may be considered periodic if the speech is a stationary voiced signal. 
Some people may consider periodic speech to include stationary voiced speech and non- 
stationary voiced speech, but for purposes of this specification, periodic speech includes 
stationary voiced speech. Furthermore, periodic speech may be smooth and stationary 
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speech. A voice speech is considered to be "stationary" when the speech-signal does not 
change more than a certain amount within a frame. Such a speech signal is more likely to 
have a well defined energy contour. A speech signal is "smooth" if the adaptive codebook 
gain Gp of that speech is greater than a threshold value. For example, if the threshold value 
5 is 0.7, a speech signal in a subframe is considered to be smooth if its adaptive codebook gain 
Gp is greater than 0.7. Non-periodic speech, or non-voiced speech, includes unvoiced speech 
(e.g., fricatives such as the "shhh" sound), transitions (e.g., onsets, offsets), background noise 
and silence. 

More specifically, in the example embodiment, the speech encoder initially derives 
10 the following parameters: 

Spectral Tilt (estimation of first reflection coefficient 4 times per frame): 

Isk(n)*s k (n-l) 

K(k) = -a^-LTi * = 0,1,...,3, (1) 

Zs k (n) 2 

where L = 80 is the window over which the reflection coefficient is calculated and s k (n) is the 
15 k* segment given by 

S k («) = s(k . 40 - 20 + n) • w h (n\ n = 0,1,...79, (2) 

where w h («) is a SO sample Hamming window and s(0), ^(l),. . ^(159) is the current frame 
of the pre-processed speech signal. 
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Absolute Maximum (tracking of absolute signal maximum, 8 estimates per frame): 

^(k) = max {s(n%n = n s {k),n s (k) + l,...,n e (k)-\l k= 0,1,.. .,7 (3) 
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where n s (k) and rie(k) is the starting point andend point, respectively, for the search of the k to 
maximum at time k • 160/8 samples of the frame. In general, the length of the segment is 1 .5 
times the pitch period and the segments overlap. Thus, a smooth contour of the amplitude 
envelope can be obtained. ~ 

The Spectral Tilt, Absolute Maximum, and Pitch Correlation parameters form the 
basis for the classification. However, additional processing and analysis of the parameters 
are performed prior to the classification decision. The parameter processing initially applies 
weighting to the three parameters. The weighting in some sense removes the background 
noise component in the parameters by subtracting the contribution from the background 
noise. This provides a parameter space that is "independent" from any background noise and 
thus is more uniform and improves the robustness of the classification to background noise. 

Running means of the pitch period energy of the noise, the spectral tilt of the noise, 
the absolute maximum of the noise, and the pitch correlation of the noise are updated eight 
times per frame according to the following equations, Equations 4-7-. The following 
parameters defined by Equations 4-7 are estimated/sampled eight times per frame, providing 
a fine time resolution of the parameter space: 

Running mean of the pitch period energy of the noise: 

< E N P (k) >= ct r < E N , P (k - 1) > +(1- a t ) • E p (k) , (4) 

where E N p (k) is the normalized energy of the pitch period at time £-160/8 samples of the 

frame. The segments over which the energy is calculated may overlap since the pitch period 
typically exceeds 20 samples (160 samples/8). 
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Running means .of the spectral tilt of the noise: 

< k n (k) >= a x • < k n (£-!)> +(1 - a x ) • k(Jc mod 2). (5) 

Running-mean of the absolute maximum of the noise: 
5 < Xn (*) >= «i ■ < X N (* - 1) > +(1 - a, ) - (6) 

Running mean of the pitch correlation of the noise: 

< R Np (k) >= a,- < R Np (k -1) > +(l-^).R Pi (7) 

1 0 where R p is the input pitch correlation for the second half of the frame. The adaptation 
constant a, is adaptive, though the typical value is ar= 0.99. 
The background noise to signal ratio is calculated according to 



The parametric noise attenuation is limited to 30 dB, i.e., 
15 y{k) = \y(k) > 0.96870.968 : y(kj) (9) 

The noise free set of parameters (weighted parameters) is obtained by removing the noise 
component according to the following Equations 10-12: 
Estimation of weighted spectral tilt: 

K w (k) = rc(kmod2)-r(ky <k N (k) > . (10) 
20 Estimation of weighted absolute maximum: 

Xw(k) = x(k)-y(k>< XN (k)>. (11) 

Estimation of weighted pitch correlation: 
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. R».p(k) = Rp-Kk) <R Np (k)>. , (12) 

The evolution of the weighted tilt and the weighted maximum is calculated according to the 
following Equations 13 and 14, respectively, as the slope of the first order approximation: 

Z'-Cr-(*-7+/)-z.(*-7» 
a*, (k) « -a (13) 

Z' 2 



i 2 



Tl-(K w (k-7 + l)-K w (k-7)) 
^w(k) = -^ - 7 (14) 

El 2 
1=1 

Once the parameters of Equations 4 through 14 are updated for the eight sample points of the 
frame, the following frame-based parameters are calculated from the parameters of Equations 
10 4-14: 

Maximum weighted pitch correlation: 

R~ = raax{^ (* - 7 + /),/ = 0,1,..., 7} (15) 
Average weighted pitch correlation: 



o /-0 



(16) 



1 5 Running mean of average weighted pitch correlation: 

<^:>)>=^<RZ(m-l)> + (l-a 2 )-RZ, (17) 
where m is the frame number and ct 2 = 0.75 is the adaptation constant. 

Normalized standard deviation of pitch lag: 
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i( Lp (m-2 + l)-HL;(m)) 

«»-i^ M — i ■ 

where L p (m) is the input pitch lag and p L {m) is the mean of the pitch lag over the past three 
frames given by 



|i u (m) = |i(L p (m-2 + l). (19) 



Minimum weighted spectral tilt: 

K™ =min{A: w (A:-7 + /),/ = 0,l,.-.,7 } (20) 
Running mean of minimum weighted spectral tilt: 

<icfW>=^'<C(^-l)>Hl-^)'C. ( 21 ) 

10 

Average weiglited spectral tilt: 

Minimum slope of weighted tilt: 

15 a/cr = min fe/r.(^-7 + /V = 0,l,-.,7 . (23) 

Accumulated slope of weighted spectral tilt: 
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d*:r = 2>-(*- 7+/ >. (24 > 

/=0 

Maximum slope of weighted maximum: 

dzT = ™**kxS k ~ 7 + = W.....7 C2S) 
Accumulated slope of weighted maximum: 
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=Zd* w (*-7 + /). (26) 

» 1=0 

The parameters given by Equations 23, 25, and 26 are used to mark whether a frame 
is likely to contain an onset, and the parameters given by Equations 16-18, 20-22 are used to 
mark whether a frame is likely to be dominated by voiced speech. Based on the initial 
5 marks, past marks and other information, the frame is classified into one of the six classes. 

A more detailed description of the manner in which the classifier 270 classifies the 
pre-processed speech 200 is described in a U.S. patent application assigned to the same 
assignee, Conexant Systems, Inc., and previously incorporated herein by reference: 
Provisional U.S. Patent Application Serial No. 60/155,321 titled "4 kbits/s Speech Coding," 

10 Conexant Docket No. 99RSS485, filed September 22, 1999. 

The LSF quantizer 267 receives the LPC coefficients from the LPC analyzer 260 and 
quantizes the LPC coefficients. The purpose of LSF quantization, which may be any known 
method of quantization including scalar or vector quantization, is to represent the coefficients 
with fewer bits. In this particular embodiment, LSF quantizer 267 quantizes the tenth order 

15 LPC model. The LSF quantizer 267 may also smooth out the LSFs in order to reduce 
undesired fluctuations in the spectral envelope of the LPC synthesis filter. The LSF 
quantizer 267 sends the quantized coefficients Aq (z) 268 to the subframe processing portion 
250 of the speech encoder. The subframe processing portion of the speech encoder is mode 
dependent. Though LSF is preferred, the quantizer 267 can quantize the LPC coefficients 

20 into a domain other than the LSF domain. 

If pitch pre-processing is selected, the weighted speech signal 256 is sent to the pitch 
preprocessor 254. The pitch preprocessor 254 cooperates with the open loop pitch estimator 
272 in order to modify the weighted speech 256 so that its pitch information can be more 
accurately quantized. The pitch preprocessor 254 may, for example, use known compression 

25 or dilation techniques on pitch cycles in order to improve the speech encoder's ability to 
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quantize the pitch gains. In other words, the pitch preprocessor 254 modifies the weighted 
speech signal 256 in order to match better the estimated pitch track and thus more accurately 
fit the coding model while producing perceptually indistinguishable reproduced speech. If 
the encoder processing circuitry selects a pitch pre-processing mode, the pitch preprocessor 
254 performs pitch pre-processing of the weighted speech signal 256. The pitch preprocessor 
254 warps the weighted speech signal 256 to match interpolated pitch values that will be 
generated by the decoder processing circuitry. When pitch pre-processing is applied, the 
warped speech signal is referred to as a modified weighted speech signal 258. If pitch pre- 
processing mode is not selected, the weighted speech signal 256 passes through the pitch pre- 
processor 254 without pitch pre-processing (and for convenience, is still referred to as the 
"modified weighted speech signal" 258). The pitch preprocessor 254 may include a 
waveform interpolator whose function and implementation are known to those of ordinary 
skill in the art. The waveform interpolator may modify certain irregular transition segments 
using known forward-backward waveform interpolation techniques in order to enhance the 
regularities and suppress the irregularities of the speech signal. The pitch gain and pitch 
correlation for the weighted signal 256 are estimated by the pitch preprocessor 254. The 
open loop pitch estimator 272 extracts information about the pitch characteristics from the 
weighted speech 256. The pitch information includes pitch lag and pitch gain information. 

The pitch preprocessor 254 also interacts with the classifier 270 through the open- 
loop pitch estimator 272 to refine the classification by the classifier 270 of the speech signal. 
Because the pitch preprocessor 254 obtains additional information about the speech signal, 
the additional information can be used by the classifier 270 in order to fine tune its 
classification of the speech signal. After performing pitch pre-processing, the pitch 
preprocessor 254 outputs pitch track information 284 and unquantized pitch gains 286 to the 
mode-dependent subframe processing portion 250 of the speech encoder. 
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Once the classifier 270 classifies the pre-processed speech 200 into one of a plurality 
of possible classes, the classification number of the pre-processed speech signal 200 is sent to 
the mode selector 274 and to the mode-dependent subframe processor 250 as control 
information 280. The mode selector 274 uses the classification number to select the mode of 

5 operation. In this particular embodiment, the classifier 270 classifies the pre-processed 

speech signal 200 into one of six possible classes. If the pre-processed speech signal 200 is 
stationary voiced speech (e.g., referred to as "periodic" speech), the mode selector 274 sets 
mode 282 to Mode 1 , Otherwise, mode selector 274 sets mode 282 to Mode 0. The mode 
signal 282 is sent to the mode dependent subframe processing portion 250 of the speech 

1 0 encoder. The mode information 282 is added to the bitstream that is transmitted to the 
decoder. 

The labeling of the speech as "periodic" and "non-periodic" should be interpreted 
with some care in this particular embodiment. For example, the frames encoded using Mode 
1 are those maintaining a high pitch correlation and high pitch gain throughout the frame 

1 5 based on the pitch track 284 derived from only seven bits per frame. Consequently, the 

selection of Mode 0 rather than Mode 1 could be due to an inaccurate representation of the 
pitch track 284 with only seven bits and not necessarily due to the absence of periodicity. 
Hence, signals encoded using Mode 0 may very well contain periodicity, though not well 
represented by only seven bits per frame for the pitch track. Therefore, the Mode 0 encodes 

20 the pitch track with seven bits twice per frame for a total of fourteen bits per frame in order 
to represent the pitch track more properly. 

Each of the functional blocks on FIGs 3-4, and the other FIGs in this specification, 
need not be discrete structures and may be combined with another one or more functional 
blocks as desired. 

25 The mode-dependent subframe processing portion 250 of the speech encoder operates 

in two modes of Mode 0 and Mode 1. FIGs. 5-6 provide functional block diagrams of the 
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- Mode 0 subframe processing while FIG. 7 illustrates the functional block diagram of the 
Mode 1 subframe processing of the third stage of the speech encoder. FIG. 8 illustrates a 
block diagram of a speech decoder that corresponds with the improved speech encoder. The 
speech decoder-performs inverse mapping of the bit-stream to the algorithm parameters 

5 followed by a mode-dependent synthesis. A more detailed description of these figures and 
modes is provided in a U.S. patent application assigned to the same assignee, Conexant 
Systems, Inc., the entire application was previously incorporated herein by reference, U.S. 
Patent Application Serial No. 09/574,396 titled "A NEW SPEECH GAIN QUANTIZATION 
STRATEGY," Conexant Docket No. 99RSS312, filed May 19, 2000. 

10 The quantized parameters representing the speech signal may be packetized and then 

transmitted in packets of data from the encoder to the decoder. In the example embodiment 
described next, the speech signal is analyzed frame by frame, where each frame may have at 
least one subframe, and each packet of data contains information for one frame. Thus, in this 
example, the parameter information for each frame is transmitted in a packet of information. 

1 5 In other words, there is one packet for each frame. Of course, other variations are possible 
and depending on the embodiment, each packet could represent a portion of a frame, more 
than a frame of speech, or a plurality of frames. 
LSF 

A LSF (line spectral frequency) is a representation of the LPC spectrum (i.e., the 
20 short term envelope of the speech spectrum). LSFs can be regarded as particular frequencies 
at which the speech spectrum is sampled. If, for example, the system uses a 10 th order LPC, 
there would be 10 LSFs per frame. There must be a minimum spacing between consecutive 
LSF's so that they do not create quasi-unstable filters. For example, if fj is the ith LSF and 
equals 100 Hz, the (i +l)st LSF, fi+i, must be at least f| + the minimum spacing. For instance, 
25 if fi = 100 Hz and the minimum spacing is 60 Hz, f !+ i must be at least 160 Hz and can be any 
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frequency greater than 160 Hz. The minimum spacing is a fixed number that does not vary 
frame by frame and is known to both the encoder and decoder so that they can cooperate. 

Let us assume that the encoder uses predictive coding to code the LSF's (as opposed 
to non-predictive coding) which is necessary to achieve speech communication at low bit 
rates. In other words, the encoder uses the quantized LSF of a previous frame or frames to 
predict the LSF of the current frame. The error between the predicted LSF and the true LSF 
of the current frame which the encoder derives from the LPC spectrum is quantized and 
transmitted to the decoder. The decoder determines the predicted LSF of the current frame in 
the same manner that the encoder did. Then by knowing the error which was transmitted by 
the encoder, the decoder can calculate the true LSF of the current frame. However, what 
happens if a frame containing LSF information is lost? Turning to FIG. 9, suppose that the 
encoder transmits frames 0-3, but the decoder only receives frames 0, 2 and 3. Frame 1 is the 
lost or "erased" frame. If the current frame is lost frame 1, the decoder does not have the 
error information that is necessary to calculate the true LSF. As a result, prior art systems 
did not calculate the true LSF and instead, set the LSF to be the LSF of the previous frame, 
or the average LSF of a certain number of previous frames. The problems with this approach 
are that the LSF of the current frame may be too inaccurate (compared to the true LSF) and 
the subsequent frames (i.e., frames 2, 3 in the example of FIG. 9) use an inaccurate LSF of 
frame 1 to determine their own LSF's. Consequently, the LSF extrapolation error introduced 
by a lost frame taints the accuracy of the LSF's of the subsequent frames. 

In an example embodiment of the present invention, an improved speech decoder 
includes a counter that counts the number of good frames that follow the lost frame. FIG. 10 
illustrates an example of the minimum LSF spacings associated with each frame. Suppose 
that good frame 0 is received by the decoder, but frame 1 is lost. Under the prior art 
approach, the minimum spacing between LSF's was a fixed number (60 Hz in FIG. 10) that 
does not change. By contrast, when the improved speech decoder notices a lost frame, it 
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increases the minimum spacing of that frame so as to avoid creating a quasi-unstable filter. 
The amount of increase in this "controlled adaptive LSF spacing 11 depends on what increase 
in spacing would be best for that particular case. For example, the improved speech decoder 
may consider how the energy of the signal (or the power of the signal) evolved over time, 
how the frequency content (spectrum) of the signal evolved over time, and the counter to 
determine at what value the minimum spacing of the lost frame should be set. A person of 
ordinary skill in the art could run simple experiments to determine what minimum spacing 
value would be satisfactory to use. One advantage of analyzing the speech signal and/or its 
parameters to derive an appropriate LSF is that the resultant LSF may be closer to the true 
(but lost) LSF of that frame. 

Adaptive Codebook Excitation fPitch Lag") 

The total excitation er composed of the adaptive codebook excitation and the fixed 
codebook excitation is described by the following equation: 

e T = g p * e xp + g c * e xc (27) 

where g p and g c are the quantized adaptive codebook gain and fixed codebook gain 
respectively and e xp and e xc are the adaptive codebook excitation and fixed codebook 
excitation. A buffer (also called the adaptive codebook buffer) holds er and its components 
from the previous frame. Based on the pitch lag parameter in the current frame, the speech 
communication system selects an ex from the buffer and uses it as e xp for the current frame. 
The values for g p , g c and e xc are obtained from the current frame. The e xp , g p , g c and e xc are 
then plugged into the formula to calculate an e T for the current frame. The calculated e T and 
its components are stored for the current frame in the buffer. The process repeats whereby 
the buffered e T is then used as e xp for the next frame. Thus, the feedback nature of this 
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encoding approach (which is replicated by the decoder) is apparent Because the information 
in the equation are quantized, the encoder and decoder are synchronized. Note that the buffer 
-is a type of an adaptive codebook (but is different than the adaptive codebook used for gain 
excitations). 

5 FIG. 1 1 illustrates an example of the pitch lag information transmitted by the prior art 

speech system for four frames 1-4. The prior art encoder would transmit the pitch lag for the 
current frame and a delta value, where the delta value is the difference between the pitch lag 
of the current frame and the pitch lag of the previous frame. The EVRC (Enhanced Variable 
Rate Coder) standard specifies the use of the delta pitch lag. Thus, for example, the packet of 

10 information concerning frame 1 would include pitch lag LI and delta (LI - L0) where L0 is 
the pitch lag of preceding frame 0; the packet of information concerning frame 2 would 
include pitch lag L2 and delta (L2 - LI); the packet of information concerning frame 3 
would include pitch lag L3 and delta (L3 - L2); and so on. Note that the pitch lags of 
adjacent frames could be equal so delta values could be zero. If frame 2 was lost and never 

15 received by the decoder, the only information about the pitch lag available at the time of 
frame 2 is pitch lag LI because the previous frame 1 was not lost. The loss of the pitch lag 
L2 and delta (L2 - LI) information created two problems. The first problem is how to 
estimate an accurate pitch lag L2 for lost frame 2. The second problem is how to prevent the 
error in estimating the pitch lag L2 from creating errors in subsequent frames. Some prior art 

20 systems do not attempt to fix either problem. 

In trying to resolve the first problem, some prior art systems use the pitch lag LI from 
the previous good frame 1 as an estimated pitch lag L2' for the lost frame 2, even though any 
difference between the estimated pitch lag L2 f and the true pitch lag L2 would be an error. 
The second problem is how to prevent the error in estimated pitch lag L2 f from 

25 creating errors in subsequent frames. Recall that, as previously discussed, the pitch lag of 
frame n is used to update the adaptive codebook buffer which in turn is used by subsequent 
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frames. The error between estimated pitch lag L2' and the true pitch; lag L2 would create an 
error in the adaptive codebook buffer which would then create an error in the subsequently 
received frames. In other words, the error in the estimated pitch lag L2 ! may result in the loss 
of synchronicity between the adaptive codebook buffer from the encoder's point of view and 
the adaptive codebook buffer from the decoder's point of view. As a further example, during 
processing of current lost frame 2, the prior art decoder would use estimate pitch lag L2 1 to 
be pitch lag L 1 (which probably differs from true pitch lag L2) to retrieve e xp for frame 2. 
The use of an erroneous pitch lag therefore selects the wrong e xp for the frame 2, and this 
error propagates through the subsequent frames. To resolve this problem in the prior art, 
when frame 3 is received by the decoder, the decoder now has pitch lag L3 and delta (L3 - 
L2) and can thus reverse calculate what true pitch lag L2 should have been. The true pitch 
lag L2 is simply pitch lag L3 minus the delta (L3 - L2). Thus, the prior art decoder could 
correct the adaptive codebook buffer that is used by frame 3. Because the lost frame 2 has 
already been processed with the estimated pitch lag L2 f , it is too late to fix lost frame 2. 

FIG. 12 illustrates a hypothetical case of frames to demonstrate the operation of an 
example embodiment of an improved speech communication system which both problemes 
due to lost pitch lag information. Suppose that frame 2 is lost and frames 0, 1, 3 and 4 are 
received. During the time that the decoder is processing lost frame 2, the improved decoder 
may use the pitch lag LI from the previous frame 1. Alternatively and preferably, the 
improved decoder may perform an extrapolation based on the pitch lag(s) of the previous 
frame(s) to determine an estimated pitch lag L2\ which may result in a more accurate 
estimation than pitch lag LI . Thus, for example, the decoder may use pitch lags LO and LI 
to extrapolate the estimated pitch lag L2'. The extrapolation method may be any 
extrapolation method such as a curve fitting method that assumes a smooth pitch contour 
from the past to estimate the lost pitch lag L2, one that uses an average of past pitch lags, or 
any other extrapolation method. This approach reduces the number of bits that is transmitted 
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from the encoder to the decoder because the delta value need not be transmitted." " 
To solve the second problem, when the improved decoder receives frame 3, the 
decoder has the correct pitch lag L3. However, as explained above, the adaptive codebook^ 
buffer used by frame 3 may be incorrect due to any extrapolation error in estimating pitch lag 
L2\ The improved decoder seeks to correct errors in estimating pitch lag L2' in frame 2 from 
affecting frames after frame 2, but without having to transmit delta pitch lag information. 
Once the improved decoder obtains pitch lag L3, it uses an interpolation method such as a 
curve fitting method to adjust or fine tune its prior estimation of pitch lag L2\ By knowing 
pitch lags LI and L3, the curve fitting method can estimate L2' more accurately than when 
pitch lag L3 was unknown. The result is a fine tuned pitch lag L2" which is used to adjust or 
correct the adaptive codebook buffer for use by frame 3. More particularly, the fine timed 
pitch lag L2" is used to adjust or correct the quantized adaptive codebook excitation in the 
adaptive codebook buffer. Consequently, the improved decoder reduces the number of bits 
that must be transmitted while fine tuning pitch lag L2 1 in a manner which is satisfactory for 
most cases. Thus, in order to reduce the affect of any error in the estimation of pitch lag L2 
on the subsequently received frames, the improved decoder may use the pitch lag L3 of the 
next frame 3 and the pitch lag LI of the previously received frame 1 to fine tune the previous 
estimation of the pitch lag L2 by assuming a smooth pitch contour. The accuracy of this 
estimation approach based on the pitch lags of the received frames preceding and succeeding 
the lost frame may be very good because pitch contours are generally smooth for voiced 
speech. 

Gains 

During the transmission of frames from the encoder to the decoder, a lost frame also 
results in lost gain parameters such as the adaptive codebook gain g p and fixed codebook 
gain g c . Each frame contains a plurality of subframes where each subframe has gain 



33 



WO 02/07061 PCT/IB01/01228 



information. Thus, the loss of a frame results in lost gain information for each subframe of 
the frame. Speech communication systems have to estimate gain information for each 
subframe of the lost frame. The gain information for one subframe may differ from that of 
another subframe. 

5 Prior art systems took various approaches to estimate the gains for subframes of the 

lost frame such as by using the gain from the last subframe of the previous good frame as the 
gains of each subframe of the lost frame. Another variation was to use the gain from the last 
subframe of the previous good frame as the gain of the first subframe of the lost frame and to 
attenuate this gain gradually before it is used as the gains of the next subframes of the lost 

10 frame. In other words, for example, if each frame has four subframes and frame 1 is received 
but frame 2 is lost, the gain parameters in the last subframe of received frame 1 are used as 
the gain parameters of the first subframe of lost frame 2, the gain parameters are then 
decreased by some amount and used as the gain parameters of the second subframe of lost 
frame 2, the gain parameters are decreased again and used as the gain parameters of the third 

15 subframe of lost frame 2, and the gain parameters are decreased still further and used as the 
gain parameters of the last subframe of lost frame 2. Still another approach was to examine 
the gain parameters of the subframes of a fixed number of previously received frames to 
calculate average gain parameters which are then used as the gain parameters of the first 
subframe of lost frame 2 where the gain parameters could be decreased gradually and used as 

20 the gain parameters of the remaining subframes of the lost frame. Yet another approach was 
to derive median gain parameters by examining the subframes of a fixed number of 
previously received frames and using the median values as the gain parameters of the first 
subframe of lost frame 2 where the gain parameters could be decreased gradually and used as 
the gain parameters of the remaining subframes of the lost frame. Notably, the prior art 

25 approaches did not perform different recovery methods to the adaptive codebook gains and 
the fixed codebook gains; they used the same recovery method on both types of gain. 
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The improved speech communication system may also handle lost gain parameters : ~ 
due to a lost frame. If the speech communication system differentiates between periodic-like 
speech and non-periodic like speech, the system may handle lost gain parameters differently 
for each type of speech. Moreover, the improved system handles lost adaptive codebook 
5 gains differently than it handles lost fixed codebook gains. Let us first examine the case of 
non-periodic like speech. To determine an estimated adaptive codebook gain g p , the 
improved decoder computes an average g p of the subframes of an adaptive number of 
previously received frames. The pitch lag of the current frame (i.e., the lost frame), which 
was estimated by the decoder, is used to determine the number of previously received frames 
10 to examine. Generally, the larger the pitch lag, the greater the number of previously received 
frames to use to calculate an average g p . Therefore, the improved decoder uses a pitch 
synchronized averaging approach to estimate the adaptive codebook gain g p for non-periodic 
like speech. The improved decoder then calculates a beta p which indicates how good the 
prediction of gp was, based on the following formula: 

15 

P = adaptive codebook excitation energy / total excitation energy ex 

|| g p * e xp || 2 / (||g p * e xp || 2 + ||g c * e xc || 2 ) (28) 

P varies from 0 to 1 and represents the percentage effect of the adaptive codebook excitation 
20 energy on the total excitation energy. The greater the p, the greater the effect of the adaptive 
codebook excitation energy. Although unnecessary, the improved decoder preferably treats 
nonperiodic-like speech and periodic-like speech differently. 

FIG. 16 illustrates an example flowchart of the decoder's processing for nonperiodic- 
like speech. Step 1 000 determines whether the current frame is the first frame lost after 
25 receiving a frame (i.e., a "good" frame). If the current frame is the first lost frame after a 
good frame, step 1002 determines whether the current subframe being processed by the 
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decoder is the first subframe of a frame: If the current subframe is the first subframe, step 
1004 computes an average g p for a certain number of previous subframes where the number 
of subframes depends on the pitch lag of the current subframe. In an- example embodiment, 
if the pitch lag is less than or equal to 40, the average g p is based on two previous subframes; 

5 if the pitch lag is greater than 40 but less than or equal to 80, the average g p is based on four 
previous subframes; if the pitch lag is greater than 80 but less than or equal to 120, the 
average g p is based on six previous subframes; and if the pitch lag is greater than 120, the 
average g p is based on eight previous subframes. Of course, these values are arbitrary and 
may be set to any other values depending on the length of the subframe. Step 1006 

10 determines whether the maximum p exceeds a certain threshold. If the maximum p exceeds 
a certain threshold, step 1 008 sets the fixed codebook gain g c for all subframes of the lost 
frame to zero and sets g p for all subframes of the lost frame to an arbitrarily high number 
such as 0.95 instead of the average g p determined above. The arbitrarily high number 
indicates a good voicing signal. The arbitrarily high number to which g p of the current 

1 5 subframe of the lost frame is set may be based on a number of factors including, but not 
limited to, the maximum p of a certain number of previous frames, the spectral tilt of the 
previously received frame and the energy of the previously received frame. 

Otherwise, if the maximum p does not exceed a certain threshold (i.e., a previously 
received frame contains the onset of speech), step 1010 sets the g p of the current subframe of 

20 the lost frame to be the minimum of (I) the average g p determined above and (ii) the 

arbitrarily selected high number (e.g., 0.95). Another alternative is to set the g p of the current 
subframe of the lost frame based on the spectral tilt of the previously received frame, the 
energy of the previously received frame, and the minimum of the average g p determined 
above and the arbitrarily selected high number (e.g., 0.95). In the case where the maximum 

25 P does not exceed a certain threshold, the fixed codebook gain g c is based on the energy of 
the gain scaled fixed codebook excitation in the previous subframe and the energy of the 
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fixed codebook excitation in the current subframe: Specifically, the energy of the gain scaled 
fixed codebook excitation in the previous subframe is divided by the energy of the fixed 
codebook excitation in the current subframe, the result is square-rooted and multiplied by an 
attenuation fraction and set to be gc, as shown in the following formula: 

g c = attenuation factor * square root (|| g c * ex C 2 / ||exdli 2 ) (29) 

Alternatively, the decoder may derive the g c for the current subframe of the lost frame to be 
based on the ratio of the energy of the previously received frame to the energy of the current 
lost frame. 

Returning to step 1002, if the current subframe is not the 1 st subframe, step 1020 sets 
the g p of the current subframe of the lost frame to a value that is attenuated or reduced from 
the g p of the previous subframe. Each g p of the remaining subframes are set to a value 
further attenuated from the g p of the previous subframe. The g c of the current subframe is 
calculated in the same manner as it was in step 1010 and formula 29. 

Returning to step 1000, if this is not the first lost frame after a good frame, step 1022 
calculates the g c of the current subframe in the same manner as it was in step 1010 and 
formula 29. Step 1 022 also sets the g p of the current subframe of the lost frame to a value 
that is attenuated or reduced from the g p of the previous subframe. Because the decoder 
estimates the g p and g c differently, the decoder may estimate them more accurately than the 
prior art systems. 

Now let us examine the case of periodic-like speech in accordance with the example 
flowchart illustrated in FIG. 17. Because the decoder may apply different approaches to 
estimating g p and g c for periodic -like speech and non-periodic like speech, the estimation of 
the gain parameters may be more accurate than the prior art approaches. Step 1030 
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T determines whether the current frame is the first frame lost after receiving a frame (i.e., a r "* : T 
"good" frame). If the current frame is the first lost frame after a good frame, step 1032 sets 
:; gc to zero for all subframes of the current frame and sets g p to an arbitrarily high number such 
- as 0.95 for all subframes of the current frame. If the current frame is not the first lost frame 
5 after a good frame (e.g., it is the 2 nd lost frame, 3 rd lost frame, etc), step 1034 sets g c to zero 
for all subframes of the current frame and sets g p to a value that is attenuated from the g p of 
the previous subframe. 

FIG. 1 3 illustrates a case of frames to demonstrate the operation of the improved 
speech decoder. Suppose that frames 1, 3 and 4 are good (i.e., received) frames while frames 
10 2, 5-8 are lost frames. If the current lost frame is the first lost frame after a good frame, the 
decoder sets g p to an arbitrarily high number (such as 0.95) for all subframes of the lost 
frame. Turning to FIG. 13, this would apply to lost frames 2 and 5. The g p of the first lost 
frame 5 is attenuated gradually to set the g p s of the other lost frames 6-8. Hence, for 
example, if g p is set to 0.95 for lost frame 5, g p could be set to 0.9 for lost frame 6 and 0.85 
15 for lost frame 7 and 0.8 for lost frame 8. For g c l s, the decoder computes the average g p from 
the previously received frames and if this average g p exceeds a certain threshold, g c is set to 
zero for all subframes of the lost frame. If the average g p does not exceed a certain threshold, 
the decoder uses the same approach of setting g c for non-periodic like signals described 
above to set g c here. 

20 After the decoder estimates the lost parameters (e.g., LSF, pitch lags, gains, 

classification, etc) in a lost frame and synthesizes the resultant speech, the decoder can match 
the energy of the synthesized speech of the lost frame with the energy of the previously 
received frame through extrapolation techniques. This may further improve the accuracy of 
reproduction of the original speech despite lost frames. 

25 

Seed for Generating Fixed Codebook Excitations 
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In order to save bandwidth, a speech encoder need not transmit a fixed codebook- 
excitation to the decoder during periods of background noise or silence. Instead, both the 
encoder and decoder can randomly generate an excitation value locally by using a Gaussian 
time series generator. Both the encoder and decoder are configured to generate the~same 
random excitation value in the same order. As a result, because the decoder can locally 
generate the same random excitation value that the encoder generated for a given noise 
frame, the excitation value need not be transmitted from the encoder to the decoder. To 
generate a random excitation value, the Gaussian time series generator uses an initial seed to 
generate the first random excitation value and then the generator updates the seed to a new 
value. Then the generator uses the updated seed to generate the next random excitation value 
and updates the seed to yet another value. FIG. 14 illustrates a hypothetical case of frames to 
illustrate how a Gaussian time series generator in a speech encoder uses a seed to generate a 
random excitation value and then updates that seed to generate the next random excitation 
value. Suppose that frames 0 and 4 contain a speech signal while frames 2, 3 and 5 contain 
silence or background noise. Upon finding the first noise frame (i.e., frame 2), the encoder 
uses the initial seed (referred to as "seed 1") to generate a random excitation value to use as 
the fixed codebook excitation for that frame. For each sample of that frame, the seed is 
changed to generate a new fixed codebook excitation. Thus, if a frame were sampled 160 
times, the seed would change 160 times. Thus, by the time the next noise frame is 
encountered (noise frame 3), the encoder uses a second and different seed (i.e., seed 2) to 
generate the random excitation value for that frame. Although technically, the seed for the 
first sample of the second frame is not the "second" seed because the seed has changed for 
every sample of the first frame, the seed for the first sample of the second frame is referred to 
herein as seed 2 for the sake of convenience. For noise frame 4, the encoder uses a third seed 
(different from the first and second seeds). To generate the random excitation value for noise 
frame 6, the Gaussian time series generator could either start over with seed 1 or proceed 
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with seed 4, depending on the implementation of the speech communication system. By 
configuring the encoder and decoder to update the seed in the same manner, the encoder and 
decoder can generate the same seed and thus the same random excitation values in the same 
order. However, a lost frame destroys this synchronicity between the encoder and decoder in 

5 prior art speech communication systems. 

FIG. 15 illustrates the hypothetical case presented in FIG. 14, but from the decoder's 
point of view. Suppose that noise frame 2 is lost and that frames 1 and 3 are received by the 
decoder. Because noise frame 2 is lost, the decoder assumes that it was of the same type as 
the previous frame 1 (i.e., a speech frame). Having made the wrong assumption about lost 

10 noise frame 2, the decoder presumes that noise frame 3 is the first noise frame when it is 

really the second noise frame encountered. Because the seeds are updated for each sample of 
every noise frame encountered, the decoder would erroneously use seed 1 to generate the 
random excitation value for noise frame 3 when seed 2 should have been used. The lost 
frame therefore resulted in lost synchronicity between the encoder and decoder. Because 

1 5 frame 2 is a noise frame, it is not significant that the decoder uses seed 1 while the encoder 
used seed 2 since the result is a different noise than the original noise. The same is true of 
frame 3. However, the error in seed values is significant for its impact on subsequently 
received frames containing speech. For example, let's focus on speech frame 4. The locally 
generated Gaussian excitation based on seed 2 is used to continually update the adaptive 

20 codebook buffer of frame 3. When frame 4 is processed, the adaptive codebook excitation is 
extracted from the adaptive codebook buffer of frame 3 based on information such as the 
pitch lag in frame 4. Because the encoder used seed 3 to update the adaptive codebook 
buffer of frame 3 and the decoder is using seed 2 (the wrong seed!) to update the adaptive 
codebook buffer of frame 3, the difference in updating the adaptive codebook buffer of frame 

25 3 could create a quality problem in frame 4 in some cases. 

The improved speech communication system built in accordance with the present 
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invention does not use an initial fixed seed and then update that seed every time the system 
encounters a noise frame. Instead, the improved encoder and decoder derives the seed for a 
given frame from parameters in that frame. For example, the spectrum information, energy 
and/or gain information in the current frame could be used to generate the seed for that 
frame. For example, one could use the bits representing the spectrum (say 5 bits bl, b2, b3, 
b4, b5) and the bits representing the energy (say, 3 bits cl, c2, c3) to form a string bl, b2, b3, 
b4, b5, cl, c2, c3 whose value is the seed. As a numeric example, suppose that the spectrum 
is represented by 0 1 1 0 1 and the energy is represented by 0 1 1 , then the seed is 0 1 1 0 1 0 1 1 . 
Certainly, other alternative methods of deriving a seed from information in the frame are 
possible and included within the scope of the invention. Consequently, in the example of 
FIG. 1 5 where noise frame 2 is lost, the decoder will be able to derive a seed for noise frame 
3 that is the same seed derived by the encoder. Thus, a lost frame does not destroy the 
synchronicity between the encoder and decoder. 

While embodiments and implementations of the subject invention have been shown 
and described, it should be apparent that many more embodiments and implementations are 
within the scope of the subject invention. Accordingly, the invention is not to be restricted, 
except in light of the claims and their equivalents. 
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What is claimed is: r; ; . 

1 . A decoder for a speech communication system, the decoder comprising: 

a receiver that receives parameters of a speech signal to be decoded, the parameters 
being received on a frame-by-frame basis and including a parameter representing the 
minimum spacing for the line spectral frequencies for each frame; 

a control logic coupled to the receiver for decoding the parameters and for 
resynthesizing the speech signal; 

a lost frame detector that detects whether a frame of parameters was not received by 
the receiver; and 

a frame recovery logic that, when the lost frame detector detects a lost frame, sets the 
minimum spacing parameter for the lost frame to a first value which is greater than the 
minimum spacing parameter for the previously received frame. 

15 2. The decoder of claim 1 wherein the lost frame detector is part of the control logic. 

3. The decoder of claim 1 wherein the frame error logic is part of the control logic. 

4. The decoder of claim 2 wherein the frame error logic is part of the control logic. 

20 

5. The decoder of claim 1 wherein the frame recovery logic sets the minimum spacing 
parameter for the frame received after the lost frame to a second value, the second value 
being greater than the minimum spacing parameter for the frame received immediately 
before the lost frame and less than the minimum spacing parameter for the lost frame. 

25 

6. The decoder of claim 5 wherein the frame recovery logic sets the minimum spacing 
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parameter for the second frame received after the lost frame to a third value, the third value 
being less than or equal to the minimum spacing parameter for the lost frame. 

7. The decoder of claim 6 wherein the frame recovery logic sets the minimum spacing 
parameter for the second frame received after the lost frame to a third value, the third value 
also being less than or equal to the minimum spacing parameter for the first frame received 
after the lost frame. 

8. The decoder of claim 1 further comprising a counter that counts the number of frames 
received subsequent to the lost frame where the count determines the value of the minimum 
spacing parameter for the received frame. 

9. The decoder of claim 5 further comprising a counter that counts the number of frames 
received subsequent to the lost frame where the count determines the-value of the minimum 
spacing parameter for the received frame. 

10. The decoder of claim 1 wherein the frame recovery logic sets the minimum spacing 
parameter for the lost frame based at least in part on the energy of the speech signal. 

1 1 . The decoder of claim 1 wherein the frame recovery logic sets the minimum spacing 
parameter for the lost frame based at least in part on the frequency spectrum of the speech 
signal. 

12. The decoder of claim 5 wherein the frame recovery logic sets the minimum spacing 
parameter for the lost frame based at least in part on the energy of the speech signal. 
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13. The decoder of claim 5 wherein the frame recovery logic sets the minimum spacing 
parameter for the lost frame based at least in part on the frequency spectrum of the speech 
signal. 

5 14. The decoder of claim 12 wherein the frame recovery logic sets the minimum spacing 
parameter for the lost frame based also at least in part on the frequency spectrum of the 
speech signal. 

15. The decoder of claim 13 wherein the frame recovery logic sets the minimum spacing 
1 0 parameter for the lost frame based also at least in part on the energy of the speech signal. 

16. A speech communication system comprising: 

an encoder that processes frames of speech and determines a pitch lag parameter for 
each frame of speech; 

15 a transmitter coupled to the encoder that transmits the pitch lag parameter for each 

frame of speech; 

a receiver that receives the pitch lag parameters from the transmitter on a frame-by- 
frame basis; 

a control logic coupled to the receiver for resynthesizing the speech signal based in 
20 part on the pitch lag parameters; 

a lost frame detector that detects whether a frame was not received by the receiver; 

and 

a frame recovery logic that, when the lost frame detector detects a lost frame, uses the 
pitch lag parameters of a plurality of previously received frames to extrapolate a pitch lag 
25 parameter for the lost frame. 

44 

BNSDOCID: <WO 0207061 A2 J _> 



WO 02/07061 



PCT/IB01/01228 



17. - The speech communication system of claim 16 wherein the frame recovery logic uses * 
the pitch lag parameter of a frame received subsequent to the lost frame to set the pitch lag 
parameter of the lost frame. 

5 18. The speech communication system of claim 1 6 wherein the lost frame detector and/or 
the frame error logic is part of the control logic. 

1 9. The speech communication system of claim 1 6 wherein when the receiver receives 
the pitch lag parameter in the frame following a lost frame, the frame recovery logic uses the 

10 pitch lag parameter of the frame following the lost frame to adjust the pitch lag parameter 
previously set for the lost frame. 

20. The speech communication system of claim 19 further comprising an adaptive 
codebook buffer containing a total excitation for a first frame, the total excitation including a 

1 5 quantized adaptive codebook excitation component, wherein the buffered total excitation is 
extracted as an adaptive codebook excitation for the frame following the first frame and the 
frame recovery logic uses the pitch lag parameter of the frame following the lost frame to 
adjust the quantized adaptive codebook excitation. 

20 21. The speech communication system of claim 1 7 wherein the frame recovery logic 

extrapolates the pitch lag parameter of the lost frame from the pitch lag parameter of a frame 
received subsequent to the lost frame. 

22. A decoder for a speech communication system, the decoder comprising: 
25 a receiver that receives parameters of a speech signal to be decoded, the parameters 

being received on a frame-by-frame basis where each frame includes a plurality of subframes 
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and the parameters include a gain parameter for each subframe of a frame; 

a control logic coupled to the receiver for decoding the parameters and for 
resynthesizing the speech signal; 

a lost frame detector that detects whether a frame of parameters was not received by 
5 the receiver; and 

a frame recovery logic that, when the lost frame detector detects a lost frame, sets the 
gain parameter of a subframe of the lost frame in a first manner if the lost gain parameter is 
an adaptive codebook gain parameter and in a second manner if the lost gain parameter is a 
fixed codebook gain parameter. 

10 

23 . The decoder of claim 22 wherein the frame recovery logic sets the gain parameter of 
a subframe of the lost frame in a third manner if the lost frame contained periodic-like speech 
and in a fourth manner if the lost frame contained nonperiodic-like speech. 

1 5 24. The decoder of claim 22 wherein the first manner differs from the second maimer. 

25. The decoder of claim 23 wherein the third maimer differs from the fourth manner. 

26. The decoder of claim 23 further comprising a periodic signal detector that determines 
20 whether the speech signal is periodic wherein if the lost frame contained nonperiodic-like 

speech and if the lost gain parameter is a fixed codebook gain parameter, the frame recovery 
logic sets the fixed codebook gain parameter of the first subframe of the lost frame to zero. 

27. The decoder of claim 26 wherein the frame recovery logic sets the fixed codebook 
25 gain parameter of all of the plurality of subframes of the lost frame to zero. 
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28. The decoder of claim 23 further comprising a periodic signal detector that determines 
whether the speech signal is periodic wherein if the lost frame contained nonperiodic-like 
speech and if the lost gain parameter is a fixed codebook gain parameter, the frame recovery 
logic sets the fixed codebook gain parameter of the first subframe of the lost frame to a value 
based on the ratio of the energy of the speech signal for a previously received frame to the 
energy of the speech signal for the lost frame. 

29. The decoder of claim 28 wherein the frame recovery logic sets the fixed codebook 
gain parameter of the remaining subframes of the lost frame to a value that decreases 
progressively from the fixed codebook gain parameter of the first subframe of the lost frame. 

30. The decoder of claim 23 wherein if the lost gain parameter is a fixed codebook gain 
parameter, the frame recovery logic sets the fixed codebook gain parameter of the first 
subframe of the lost frame to zero regardless if the lost frame contained periodic-like speech 
or nonperiodic-like speech. 

3 1 . The decoder of claim 23 further comprising a periodic signal detector that determines 
whether the speech signal is periodic wherein if the lost frame contained periodic-like speech 
and if the lost gain parameter is a fixed codebook gain parameter, the frame recovery logic 
determines whether the average adaptive codebook gain parameter of a plurality of the 
previously received frames exceeds a threshold and if the average adaptive codebook gain 
parameter exceeds the threshold, the frame recovery logic sets the fixed codebook gain 
parameter of the first subframe of the lost frame to zero. 

32. The decoder of claim 3 1 wherein if the average adaptive codebook gain parameter is 
less than the threshold, the frame recovery logic sets the fixed codebook gain parameter of 
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the first subframe of the lost frame to zero. ^ 

33. The decoder of claim 3 1 wherein if the average adaptive codebook gain parameter is 
less than the threshold, the frame recovery logic sets the fixed codebook gain parameter of 

5 the first subframe of the lost frame to a value based on the ratio of the energy of the speech 
signal for a previously received frame to the energy of the speech signal for the lost frame. 

34. The decoder of claim 23 wherein if the current frame being processed by the decoder 
is the first frame to be lost after the decoder received a frame, the frame recovery logic sets 

10 the adaptive gain parameter of the first subframe of the lost frame to an arbitrarily high 
number. 

35. The decoder of claim 34 wherein the plurality of subframes of the lost frame is set to 
the arbitrarily high number. 

15 

36. The decoder of claim 34 wherein the frame recovery logic sets the adaptive gain 
parameter of each of the remaining subframes of the lost frame to a value that decreases 
progressively from the adaptive gain parameter of the first subframe of the lost frame. 

20 37. The decoder of claim 23 further comprising a periodic signal detector that determines 
whether the speech signal is periodic wherein if the lost frame contained nonperiodic-Iike 
speech and if the lost gain parameter is an adaptive codebook gain parameter, the frame 
recovery logic determines an average adaptive codebook gain parameter for an adaptive 
number of the previously received frames. 

25 

38. The decoder of claim 37 further comprising a periodic signal detector that determines 
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whether the speech signal is periodic wherein if the lost frame contained nonperiodic-like 
. speech and a previously received frame contains an adaptive codebook excitation energy and 
if the lost gain parameter is an adaptive codebook gain parameter, the frame recovery logic 
also determines a first value based on the ratio of the adaptive codebook excitation energy to 
the total excitation energy. 

39. The decoder of claim 38 wherein if the first value exceeds a threshold, the frame 
recovery logic sets the adaptive codebook gain parameter of the current subframe of the lost 
frame to an arbitrarily high number. 

40. The decoder of claim 38 wherein if the first value is less than a threshold, the frame 
recovery logic sets the adaptive codebook gain parameter of the current subframe of the lost 
frame to the average adaptive codebook gain parameter. 

4 1 . The decoder of claim 39 wherein the arbitrarily high number is based on the spectral 
tilt of a previously received frame. 

42. The decoder of claim 41 wherein the arbitrarily high number is based on the energy 
of the speech signal in the previously received frame. 

43. The decoder of claim 41 wherein the arbitrarily high number is based on the energy 
of the speech signal in the previously received frame and the first value. 

44. The decoder of claim 37 further comprising an onset detector which detects if a frame 
contains a speech onset signal wherein if the frame contains a speech onset signal, the frame 
recovery logic sets the adaptive codebook gain parameter of the current subframe of the lost 
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frame to the lesser of the average/adaptive codebook gain parameter and an arbitrarily high 
number. 

45. The decoder of claim 44 wherein the arbitrarily high number is based on the spectral 
5 tilt of a previously received frame. 

46. The decoder of claim 44 wherein the arbitrarily high number is based on the energy 
of the speech signal in the previously received frame. 

10 47. The decoder of claim 45 wherein a previously received frame contains an adaptive 
codebook excitation energy and the arbitrarily high number is based on the energy of the 
speech signal in the previously received frame and a first value based on the ratio of the 
adaptive codebook excitation energy to the total excitation energy. 

15 48. The decoder of claim 1 wherein after the frame recovery logic sets the lost parameters 
of the lost frame, the decoder resynthesizes the speech from the lost frame and adjusts the 
energy of the synthesized speech to match the energy of the synthesized speech from a 
previously received frame. 

20 49. The decoder of claim 5 wherein after the frame recovery logic sets the lost parameters 
of the lost frame, the decoder resynthesizes the speech from the lost frame and adjusts the 
energy of the synthesized speech to match the energy of the synthesized speech from a 
previously received frame. 

25 50. The decoder of claim 1 1 wherein after the frame recovery logic sets the lost 

parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 
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adjusts the energy of the synthesized speech to match the energy of the synthesized speech 
from a previously received frame. 

5 1 . The speech communication system of claim 1 6 wherein after the frame recovery logic 
sets the lost parameters of the lost frame, the decoder resynthesizes the speech from the lost 
frame and adjusts the energy of the synthesized speech to match the energy of the 
synthesized speech from a previously received frame. 

52. The speech communication system of claim 17 wherein after the frame recovery logic 
sets the lost parameters of the lost frame, the decoder resynthesizes the speech from the lost 
frame and adjusts the energy of the synthesized speech to match the energy of the 
synthesized speech from a previously received frame. 

53 . The speech communication system of claim 1 8 wherein after the frame recovery logic 
sets the lost parameters of the lost frame, the decoder resynthesizes the speech from the lost 
frame and adjusts the energy of the synthesized speech to match the energy of the 
synthesized speech from a previously received frame. 

54. The decoder of claim 22 wherein after the frame recovery logic sets the lost 
parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 
adjusts the energy of the synthesized speech to match the energy of the synthesized speech 
from a previously received frame. 

55. The decoder of claim 26 wherein after the frame recovery logic sets the lost 
parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 
adjusts the energy of the synthesized speech to match the energy of the synthesized speech 



51 



WO 02/07061 



PCT/IB01/01228 



„. f rom a previously received frame. - 

56. The decoder of claim 28 wherein after the frame recovery logic sets the lost 
parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 

5 adjusts the energy of the synthesized speech to match the energy of the synthesized speech 
from a previously received frame. 

57. The decoder of claim 30 wherein after the frame recovery logic sets the lost 
parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 

10 adjusts the energy of the synthesized speech to match the energy of the synthesized speech 
from a previously received frame. 

58. The decoder of claim 3 1 wherein after the frame recovery logic sets the lost 
parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 

1 5 adjusts the energy of the synthesized speech to match the energy of the synthesized speech 
from a previously received frame. 

59. The decoder of claim 33 wherein after the frame recovery logic sets the lost 
parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 

20 adjusts the energy of the synthesized speech to match the energy of the synthesized speech 
from a previously received frame. 

60. The decoder of claim 37 wherein after the frame recovery logic sets the lost 
parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 

25 adjusts the energy of the synthesized speech to match the energy of the synthesized speech 
from a previously received frame. 
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61 . The decoder of claim 44 wherein after the frame recovery logic sets the lost 
parameters of the lost frame, the decoder resynthesizes the speech from the lost frame and 
adjusts the energy of the synthesized speech to match the energy of the synthesized speech 

5 from a previously received frame. 

62. A method for generating a fixed codebook excitation for a frame of speech in a 
speech communication system comprising the steps of: 

providing a Gaussian time series generator; 
1 0 providing a first frame containing characteristics of a first speech signal; 

using the characteristics of the first speech signal in the first frame to derive a first 
seed value; 

providing the first seed value to the Gaussian time series generator; 

using the first seed value to generate a fixed codebook excitation for the first frame; 

15 and 

transmitting the characteristics of the first speech signal. 

63. The method of claim 62 further comprising the steps of: 

providing a second frame containing characteristics of a second speech signal; 
20 using the characteristics of the second speech signal in the second frame to derive a 

second seed value that is different than the first seed value; 

providing the second seed value to the Gaussian time series generator; 

using the second seed value to generate a fixed codebook excitation for the second 
frame; and 

25 transmitting the characteristics of the second speech signal. 
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64. The method of claim 62 wherein the step of providing a first frame is accomplished in 
an encoder that does not transmit a fixed codebook excitation. 

65. The method of claim 62 wherein the step of providing a first frame is accomplished in 
a decoder that does not receive a fixed codebook excitation by receiving information about 
the characteristics of the speech signal in the first frame. 

66. The method of claim 62 further comprising the steps of: 

receiving the characteristics of the first speech signal for the first frame; 

using the characteristics of the first speech signal to derive the first seed value; 

providing the first seed value to the Gaussian time series generator; and 

using the first seed value to generate a fixed codebook excitation for the first frame. 

67. The method of claim 63 further comprising the steps of: 

receiving the characteristics of the second speech signal for the second frame; 
using the characteristics of the second speech signal to derive the second seed value 
that is different than the first seed value; 

providing the second seed value to the Gaussian time series generator; and 

using the second seed value to generate a fixed codebook excitation for the second 

frame. 

68. The method of claim 62 wherein the steps are performed by an encoder. 

69. The method of claim 66 wherein the steps are performed by a decoder. 

70. A method of coding or decoding speech in a communication system comprising the 
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steps of: 

(a) providing a speech signal on a frame-by-frame basis where each frame 
includes a plurality of subframes; 

(b) determining a parameter for each frame based on the speech signal; 

(c) transmitting parameters on a frame-by-frame basis; 

(d) receiving parameters on a frame-by-frame basis; 

(e) detecting whether a frame containing the parameter is lost; 

(f) handling the lost parameter for the lost frame if a frame was lost; 

(g) decoding the parameters to reproduce the speech signal. 

71 . The method of claim 71 wherein the lost parameter represents the minimum spacing 
of a line spectral frequency for the lost frame. 

72. The method of claim 71 wherein the handling step sets the minimum spacing 
parameter for the lost frame to a first value which is greater than or equal to the minimum 
spacing parameter for the previously received frame. 

73. The method of claim 72 wherein the handling step sets the minimum spacing 
parameter for the frame received after the lost frame to a second value, the second value 
being greater than or equal to the minimum spacing parameter for the frame received 
immediately before the lost frame and less than or equal to the minimum spacing parameter 
for the lost frame. 

74. The method of claim 72 wherein the first value is based at least in part on the 
frequency spectrum of the speech signal. 
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75. The method of claim 72 wherein the first value is based at least in part on the energy 
of the speech signal. 

76. The method of claim 71 wherein the lost parameter is a pitch lag parameter for the 
5 lost frame and the handling step sets the lost pitch lag parameter for the lost frame based at 

least in part on the pitch lag parameter of a previously received frame. 

77. The method of claim 76 wherein the handling step sets the lost pitch lag parameter of 
the lost frame based on the pitch lag parameters of a plurality of previously received frames. 

10 

78. The method of claim 76 wherein the handling step sets the lost pitch lag parameter of 
the lost frame based on the pitch lag parameter of a frame received subsequent to the lost 
frame. 

15 79. The method of claim 70 further comprising the step of determining whether the 
speech signal is periodic-like or nonperiodic-like and wherein the lost parameter is a gain 
parameter for a subframe of the lost frame. 

80. The method of claim 79 wherein the handling step sets the lost gain parameter of a 
20 subframe of the lost frame containing periodic-like speech differently than the step sets the 

lost gain parameter of a subframe of the lost frame containing nonperiodic-like speech. 

8 1 . The method of claim 79 wherein if the lost frame contained nonperiodic-like speech 
and if the lost gain parameter is a fixed codebook gain parameter, the handling step sets the 

25 fixed codebook gain parameter of the first subframe of the lost frame to zero. 
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82. The method of claim 8 1 wherein the handling step sets the fixed codebook gain 
codebook parameter of all of the plurality of subframes of the lost frame to zero. 

83. The method of claim 79 wherein if the lost frame contained nonperiodic-like speech 
5 and if the lost gain parameter is a fixed codebook gain parameter, the handling step sets the 

fixed codebook gain parameter of the first subframe of the lost frame to a value based on the 
ratio of the energy of the speech signal for a previously received frame to the energy of the 
speech signal for the lost frame. 

10 84. The method of claim 83 wherein the handling step sets the fixed codebook gain 
parameter of the remaining subframes of the lost frame to a value that decreases 
progressively from the fixed codebook gain parameter of the first subframe of the lost frame. 

85. The method of claim 79 wherein if the lost gain parameter is a fixed codebook gain 

1 5 parameter, the handling step sets the fixed codebook gain parameter of the first subframe of 
the lost frame to zero regardless if the lost frame contained periodic-like speech or 
nonperiodic-like speech. 

86. The method of claim 79 wherein if the lost frame contained periodic-like speech and 
20 if the lost gain parameter is a fixed codebook gain parameter, the handling step determines 

whether the average adaptive codebook gain parameter of a plurality of the previously 
received frames exceeds a threshold and if the average adaptive codebook gain parameter 
exceeds the threshold, the handling step sets the fixed codebook gain parameter of the first 
subframe of the lost frame to zero. 

25 

87. The method of claim 86 wherein if the average adaptive codebook gain parameter is 
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r\ less than the threshold, the handling step sets the fixed codebook gain parameter of the first 
subframe of the lost frame to zero. 

88. The method of claim 86 wherein if the average adaptive codebook gain parameter is 
5 less than the threshold, the handling step sets the fixed codebook gain parameter of the first 

subframe of the lost frame to a value based on the ratio of the energy of the speech signal for 
a previously received frame to the energy of the speech signal for the lost frame. 

89. The method of claim 79 wherein if the current frame received is the first frame lost 
10 after the receipt of a frame and if the lost gain parameter is an adaptive codebook gain 

parameter of the lost frame, the handling step sets the adaptive gain parameter of the first 
subframe of the lost frame to an arbitrarily high number. 

90. The method of claim 89 wherein the plurality of subframes of the lost frame is set to 
15 the arbitrarily high number. 

91 . The method of claim 79 wherein if the lost frame contained nonperiodic-like speech 
and if the lost gain parameter is an adaptive codebook gain parameter of the lost frame, the 
handling step determines an average adaptive codebook gain parameter for an adaptive 

20 number of the previously received frames. 

92. The method of claim 91 wherein if the lost frame contained nonperiodic-like speech 
and a previously received frame contains an adaptive codebook excitation energy, the 
handling step determines a first value based on the ratio of the adaptive codebook excitation 

25 energy to the total excitation energy. 
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93. The method of claim 92 wherein if the first value exceeds a threshold, the handling 
step sets the adaptive codebook gain parameter of the current subframe of the lost frame to an 
arbitrarily high number. 

5 94. The method of claim 92 wherein if the first value is less than a threshold, the handling 
step sets the adaptive codebook gain parameter of the current subframe of the lost frame to 
the average adaptive codebook gain parameter. 

95. The method of claim 93 wherein the arbitrarily high number is based on the spectral 
10 tilt of a previously received frame, the energy of the speech signal in the previously received 

frame, and/or the first value. 

96. The method of claim 89 further comprising an onset detector which detects if a frame 
contains a speech onset signed wherein if the frame contains a speech onset signal, the 

15 handling step sets the adaptive codebook gain parameter of the current subframe of the lost 
frame to the lesser of the average adaptive codebook gain parameter and an arbitrarily high 
number. 

97. The method of claim 71 further comprising the steps of: 

20 resynthesizing the speech from the lost frame after the handling step sets the lost 

parameter of the lost frame; and 

adjusting the energy of the synthesized speech to match the energy of the synthesized 
speech from a previously received frame. 

25 98. The method of claim 76 further comprising the steps of: 

resynthesizing the speech from the lost frame after the handling step sets the lost 
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parameter of the ib'st frame; and 

adjusting the energy of the synthesized speech to match the energy of the synthesized 
speech from a previously received frame. 

99. The method of claim 79 further comprising the steps of: 

resynthesizing the speech from the lost frame after the handling step sets the lost 
parameter of the lost frame; and 

adjusting the energy of the synthesized speech to match the energy of the synthesized 
speech from a previously received frame. 

100. The decoder of claim 22 wherein the lost frame detector or the frame error logic is 
part of the control logic. 

101. The decoder of claim 22 wherein the lost frame detector and the frame error logic are 
part of the control logic. 
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